Question 1: Multiple Linear Regression: A Beachday!

For this question, we’ll use data from the venerable RIKZ dataset (download here) as seen in Zuur 2009 and described a bit more here in Feiberg’s book. Briefly, it’s from a survey invertebrates in Dutch beaches looking at zonation. We will look at a few variables: 1. Richness = species richness (i.e., the number of species counted). 2. NAP = height of the sample site (relative to sea level). Lower values indicate sites closer to or further below sea level, which implies more time spent submerged. 3. exposure: index composed of the surf zone, slope, grain size, and depth of anaerobic layer 4. grainsize: the size of sediment/sand grains in the sample 5. Beach - a unique identifier of the beach sampled

Step 1: Load libraries

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(readr)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(broom)
library(emmeans)
library(performance)
library(visreg)

Step 2: Set theme

custom_theme <- function(){ 
  font <- "Helvetica" # font selection
    
    theme_minimal() %+replace% # theme based on minimal with following replacements
    
    theme(
      
      panel.grid.major = element_blank(), # leave grids and axis ticks blank
      panel.grid.minor = element_blank(),    
      axis.ticks = element_blank(),
      axis.line = element_line(color = "black",
                               size = 1),
      panel.border = element_rect(color = "black",
                                  fill=NA,
                                  size=1),
      plot.title = element_text(family = font,            
                   size = 20,                
                   face = 'bold',            
                   hjust = 0.5, # move title to center horizontally
                   vjust = 2), # move title up a wee bit
      plot.subtitle = element_text(         
                   family = font,           
                   size = 15,
                   hjust = 0.5),               
      plot.caption = element_text(           
                   family = font,           
                   size = 10,                 
                   hjust = 1), # put caption in right corner
      axis.title = element_text(             
                   family = font,
                   face = 'italic',
                   size = 15),               
      axis.text = element_text(              
                   family = font,            
                   size = 10),               
      axis.text.x = element_text(            
                    margin = margin(t = 2, # top
                                    r = 2, # right
                                    b = 2, # bottom
                                    l = 2)) # left
    )
}

Step 3:

Load the data, select to those variables (you might want to also retain Sample, but, eh), and visually inspect the data. Are there any problems you might see in the data? Things you might want to mutate or watch our for in thinking about building models? Or is everything copacetic for a model where NAP, exposure, and grainsize predict species richness?

inverts <- read.csv("/Users/nathanstrozewski/Documents/Everything/Adult Stuff/Education/M.S. Biology UMB/Courses/003_F2022/Biological Data Analysis/Homework/Homework #7/Data/rikz.csv")

str(inverts)
## 'data.frame':    45 obs. of  16 variables:
##  $ X            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Sample       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ week         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ angle1       : int  32 62 65 55 23 129 126 52 26 143 ...
##  $ angle2       : int  96 96 96 96 96 89 89 89 89 89 ...
##  $ exposure     : int  10 10 10 10 10 8 8 8 8 8 ...
##  $ salinity     : num  29.4 29.4 29.4 29.4 29.4 29.6 29.6 29.6 29.6 29.6 ...
##  $ temperature  : num  17.5 17.5 17.5 17.5 17.5 20.8 20.8 20.8 20.8 20.8 ...
##  $ NAP          : num  0.045 -1.036 -1.336 0.616 -0.684 ...
##  $ penetrability: num  254 227 237 249 252 ...
##  $ grainsize    : num  222 200 194 221 202 ...
##  $ humus        : num  0.05 0.3 0.1 0.15 0.05 0.1 0.1 0.1 0.15 0 ...
##  $ chalk        : num  2.05 2.5 3.45 1.6 2.45 2.5 1.85 1.7 2.3 2.6 ...
##  $ sorting1     : num  69.8 59 59.2 67.8 57.8 ...
##  $ Beach        : int  1 1 1 1 1 2 2 2 2 2 ...
##  $ Richness     : int  11 10 13 11 10 8 9 8 19 17 ...
summary(inverts)
##        X          Sample        week           angle1           angle2     
##  Min.   : 1   Min.   : 1   Min.   :1.000   Min.   :  6.00   Min.   :21.00  
##  1st Qu.:12   1st Qu.:12   1st Qu.:2.000   1st Qu.: 22.00   1st Qu.:32.00  
##  Median :23   Median :23   Median :2.000   Median : 32.00   Median :42.00  
##  Mean   :23   Mean   :23   Mean   :2.333   Mean   : 50.31   Mean   :57.78  
##  3rd Qu.:34   3rd Qu.:34   3rd Qu.:3.000   3rd Qu.: 55.00   3rd Qu.:89.00  
##  Max.   :45   Max.   :45   Max.   :4.000   Max.   :312.00   Max.   :96.00  
##     exposure        salinity     temperature         NAP         
##  Min.   : 8.00   Min.   :26.4   Min.   :15.80   Min.   :-1.3360  
##  1st Qu.:10.00   1st Qu.:27.1   1st Qu.:17.50   1st Qu.:-0.3750  
##  Median :10.00   Median :27.9   Median :18.77   Median : 0.1670  
##  Mean   :10.22   Mean   :28.1   Mean   :18.77   Mean   : 0.3477  
##  3rd Qu.:11.00   3rd Qu.:29.4   3rd Qu.:20.00   3rd Qu.: 1.1170  
##  Max.   :11.00   Max.   :29.9   Max.   :20.80   Max.   : 2.2550  
##  penetrability     grainsize         humus             chalk       
##  Min.   :151.8   Min.   :186.0   Min.   :0.00000   Min.   : 0.850  
##  1st Qu.:237.1   1st Qu.:222.5   1st Qu.:0.00000   1st Qu.: 2.200  
##  Median :256.1   Median :266.0   Median :0.05000   Median : 4.750  
##  Mean   :289.4   Mean   :272.5   Mean   :0.05028   Mean   : 7.961  
##  3rd Qu.:272.9   3rd Qu.:316.5   3rd Qu.:0.10000   3rd Qu.: 9.150  
##  Max.   :624.0   Max.   :405.5   Max.   :0.30000   Max.   :45.750  
##     sorting1          Beach      Richness     
##  Min.   : 53.08   Min.   :1   Min.   : 0.000  
##  1st Qu.: 72.75   1st Qu.:3   1st Qu.: 3.000  
##  Median : 89.03   Median :5   Median : 4.000  
##  Mean   : 97.82   Mean   :5   Mean   : 5.689  
##  3rd Qu.:115.44   3rd Qu.:7   3rd Qu.: 8.000  
##  Max.   :248.60   Max.   :9   Max.   :22.000
inverts <- inverts %>% 
  select(Sample, Richness, NAP, exposure, grainsize) # select vars

inverts[is.na(inverts) | inverts=="Inf"] = NA # set Inf to NA

inverts <- inverts %>% 
  drop_na() # remove NAs

Step 4: Visualize the data

This was not outlined in the homework instructions, but will inform me about certain mods I might need to make to my model, such as log transformation.

richbyNAP_plot <- ggplot(data = inverts,
                         mapping = aes(x = NAP,
                                       y = Richness,
                                       color = Sample)) +
  geom_point() + # model relationship between NAP and Richness
  custom_theme()

richbyNAP_plot

richbyexposure_plot <- ggplot(data = inverts,
                         mapping = aes(x = exposure,
                                       y = Richness,
                                       color = Sample)) +
  geom_point() + # model relationship between exposure and Richness
  custom_theme()

richbyexposure_plot

richbygrainsize_plot <- ggplot(data = inverts,
                         mapping = aes(x = grainsize,
                                       y = Richness,
                                       color = Sample)) +
  geom_point() + # model relationship between grainsize and Richness
  custom_theme()

richbygrainsize_plot

Exposure is discrete while the grainsize and NAP are continuous. There is no clear indication yet that any mods need to be implemented with my model.

Step 5: Model

Model richness as a function of exposure, grainsize, and NAP. Evaluate the assumptions of a linear model. If it does not meet them, either log(x+1) or asinh transform your response variable. It might still not be perfect, but, you’ll see why later…

inverts_lm <- lm(Richness ~ NAP + exposure + grainsize,
                data = inverts)

check_model(inverts_lm)

residualPlots(inverts_lm,
              test=FALSE)

PPC looks ok - but not great. Model-predicted data is shifted higher in richness than observed data. This indicates that the model isn’t an excellent fit. Aside from this, I’m only concerned about HOV - the reference line is not flat and horizontal. The residuals look pretty good, with only the fitted values really bowing.

I am going to try asinh transforming. I’ll compare to the original model and see if any of the issues are resolved.

Step 6: Model transformation

inverts_lm_asinh <- lm(asinh(Richness) ~ NAP + exposure + grainsize,
                       data = inverts)

check_model(inverts_lm_asinh)

While asinh transforming improves the PPC, it makes the normality and linearity worse while having little to no effect on HOV. The instructions seem to suggest that my model will look weird and that I should proceed anyways…I’m more comfortable proceeding with a clean PPC, so I will stick with the asinh transformed model going forward.

Step 7: Evaluate the model

What does your model mean? Tell us what the coefficients tell us - both unstandardized and standardized. How much variation in the data is explained by the predictors?

tidy(inverts_lm_asinh)
## # A tibble: 4 × 5
##   term           estimate std.error statistic  p.value
##   <chr>             <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  6.90         0.865     7.98    7.10e-10
## 2 NAP         -0.683        0.0773   -8.84    4.77e-11
## 3 exposure    -0.449        0.104    -4.32    9.60e- 5
## 4 grainsize   -0.00000312   0.00166  -0.00188 9.99e- 1
r2(inverts_lm_asinh)
## # R2 for Linear Regression
##        R2: 0.736
##   adj. R2: 0.717

All independent variables considered in this model (NAP, exposure, and grainsize) decrease richness. Some of this makes sense intuitively - for example, if kelp are more exposed they’ll be picked off by predators and richness will decrease.

The R2 for the model is 0.736. An R2 of 1.0 would indicate that all the variance in the model is explained by the independent variables, while an R2 of 0 would indicate that none of the variance is. The R2 of 0.736 in this model suggests that much but not all of the variance in Richness is explained by NAP, exposure, and grainsize - there’s something else that this model doesn’t consider.

Step 8: Plot model

Show us a plot that teases out the independent contribution of each predictor. What does it tell you that just looking at the coefficients above did not. Or does it not tell you anything new?

richness_predict <- augment(inverts_lm_asinh,
                            interval="confidence") %>% 
  rename(Richness = "asinh(Richness)")

richness_prediction_plot <- ggplot(data = richness_predict,
                                   mapping = aes(x = NAP,
                                                 color = grainsize)) +
  geom_point(mapping = aes(y = Richness)) +
  geom_line(mapping = aes(y = .fitted)) +
  geom_ribbon(mapping = aes(ymin = .lower,
                            ymax = .upper),
              fill = "lightgrey",
              alpha = 0.5,
              color = NA) +
  facet_wrap(vars(exposure)) +
  scale_color_gradient(low = "black",
                       high = "magenta") +
  custom_theme()
richness_prediction_plot

Step 9: Cool viz

Based on all of the above, construct some cool visualization on the original data scale that tells you something useful and interesting about what determines species richness on a beach.

richness_coolviz <- ggplot(inverts,
                           mapping = aes(x = NAP,
                                         y = Richness,
                                         color = as.factor(exposure))) +
  geom_line() +
  geom_point() +
  custom_theme() +
  labs(title = "Influence of NAP on Richness",
       subtitle = "By Exposure Type",
       x = "NAP",
       y = "Richness",
       color = "Exposure")

richness_coolviz

Question 2: Multiple Categorical Variables

To examine how models with multiple categorical models work, let’s take a look at good ole’ palmerpenguins! In this data, We’ll look at the effects of species, sex, and year (as a categorical variable!) on bill depth.

Step 1:

To begin with, load up the data, and filter out anything with an NA for sex and making sure year is no longer continuous. Then, visualize the data, just to get an idea of what is going on here. Do you notice anything?

library(palmerpenguins)
library(dplyr)

data(penguins)

penguins <- penguins %>% 
  filter(!is.na(sex)) %>% 
  mutate(year = as.factor(year))
str(penguins)
## tibble [333 × 8] (S3: tbl_df/tbl/data.frame)
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num [1:333] 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
##  $ bill_depth_mm    : num [1:333] 18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
##  $ flipper_length_mm: int [1:333] 181 186 195 193 190 181 195 182 191 198 ...
##  $ body_mass_g      : int [1:333] 3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 1 2 1 2 1 2 2 ...
##  $ year             : Factor w/ 3 levels "2007","2008",..: 1 1 1 1 1 1 1 1 1 1 ...
penguin_basicviz <- ggplot(data = penguins,
                           mapping = aes(x = year,
                                         y = bill_depth_mm,
                                         color = sex)) +
  geom_violin() +
  facet_wrap(vars(species)) +
  custom_theme()

penguin_basicviz

This basic visualization shows that bill depth is larger for males than females, and smaller for the Gentoo species compared to Adelie and Chinstrap species. There is some variance across year but this isn’t a metric of age, rather a metric of sampling date - so I won’t put too much stock into this.

Step 2:

Fit the model and check assumptions. Be sure to be careful of linearity. Do we meet assumptions here?

penguin_lm <- lm(bill_depth_mm ~ species + sex + year,
                 data = penguins)

check_model(penguin_lm)

The PPC is nearly perfect. It is hard to interpret the points on the linearity and HOV plots, but the reference lines are flat and horizontal. Everything looks good so far.

Step 3:

Great! Now - what does it all mean? Use the coefficients and/or the expected means to evaluate which of these predictors appear to influence bill depth.

sex_em <- emmeans(penguin_lm,
          specs = ~sex) %>% 
  contrast(method = "pairwise") %>% 
  confint()

year_em <- emmeans(penguin_lm,
           specs = ~year) %>% 
  contrast(method = "pairwise") %>% 
  confint()

species_em <- emmeans(penguin_lm,
              specs = ~species) %>% 
  contrast(method = "pairwise") %>% 
  confint()

sex_em
##  contrast      estimate     SE  df lower.CL upper.CL
##  female - male     -1.5 0.0911 327    -1.68    -1.33
## 
## Results are averaged over the levels of: species, year 
## Confidence level used: 0.95
year_em
##  contrast            estimate    SE  df lower.CL upper.CL
##  year2007 - year2008   0.2105 0.114 327  -0.0573    0.478
##  year2007 - year2009   0.1442 0.112 327  -0.1205    0.409
##  year2008 - year2009  -0.0664 0.110 327  -0.3248    0.192
## 
## Results are averaged over the levels of: species, sex 
## Confidence level used: 0.95 
## Conf-level adjustment: tukey method for comparing a family of 3 estimates
species_em
##  contrast           estimate    SE  df lower.CL upper.CL
##  Adelie - Chinstrap  -0.0565 0.122 327   -0.344    0.231
##  Adelie - Gentoo      3.3638 0.103 327    3.122    3.605
##  Chinstrap - Gentoo   3.4202 0.127 327    3.121    3.719
## 
## Results are averaged over the levels of: sex, year 
## Confidence level used: 0.95 
## Conf-level adjustment: tukey method for comparing a family of 3 estimates
r2(penguin_lm)
## # R2 for Linear Regression
##        R2: 0.825
##   adj. R2: 0.822

Males have larger bill depths than females. Bill depths were larger in 2007 than 2008 or 2009, and were larger in 2009 than 2008. Gentoo penguins had the smallest bill depths while Chinstraps had the largest.

The R2 for the model is 0.825 - indicating that a large portion (but not all) of the variance in the model is explained by the included independent variables.

Step 4:

How would you visualize what you learned in 2.3 to communicate it to a reader in a paper? Make the relevant plots, and make it fancy! (What? I’m a fan of the series).

library(visreg)
library(ggpubr)

sex_visreg <- visreg(penguin_lm,
                     "sex",
                     gg = TRUE) +
  custom_theme() +
  labs(x = "Sex",
       y = "Bill Depth (mm)")
sex_visreg

species_visreg <- visreg(penguin_lm,
                         "species",
                         gg = TRUE) +
  custom_theme() +
  labs(x = "Species",
       y = "Bill Depth (mm)")
species_visreg

year_visreg <- visreg(penguin_lm,
                      "year",
                      gg = TRUE) +
  custom_theme() +
  labs(x = "Year",
       y = "Bill Depth (mm)")

figure <- ggarrange(species_visreg,
                    sex_visreg,
                    year_visreg,
                    labels = c("A", "B", "C"),
                    ncol = 3, nrow = 1)
figure

figure_text <- paste("Figure X: A linear model was generated to examine the influence of species,",
                     "sex, and year on penguin bill depth. (A) Bill depth was plotted by penguin species.", 
                     "Gentoo penguins exhibited the lowest bill depth. (B) Bill depth was plotted by",
                     "sex. Males exhibited larger bill depths than females. (C) Bill depth was plotted",
                     "by year. Marginal differences were observed across years, with 2007 demonstrating",
                     "the largest bill depth.")

figure_para <- ggparagraph(text = figure_text,
                                  size = 10,
                                  color = "black") # format description

figure_final <- ggarrange(figure,
                          figure_para,
                          ncol = 1,
                          nrow = 2)

figure_final

Question 3: Comparing Means with Covariates

We will wrap up with a model mixing continuous and discrete variables. And you’ve seen it before! In this dataset from Scantlebury et al, the authors explored how caste affected the energy level of naked mole rats. BUT - castest might differ by mass as well!

Step 1:

Load and plot the data. Look at how each pair of variables relates to one another here. Is there a chance our covariate might be important? Impress yourself by making them super fancy! Geom to your heart’s content.

rats <- read.csv("/Users/nathanstrozewski/Documents/Everything/Adult Stuff/Education/M.S. Biology UMB/Courses/003_F2022/Biological Data Analysis/Homework/Homework #7/Data/18e4MoleRatLayabouts.csv")

str(rats)
## 'data.frame':    35 obs. of  3 variables:
##  $ caste   : chr  "worker" "worker" "worker" "worker" ...
##  $ lnmass  : num  3.85 3.99 4.11 4.17 4.25 ...
##  $ lnenergy: num  3.69 3.69 3.69 3.66 3.87 ...
rat_plot <- ggplot(data = rats,
                   mapping = aes(x = lnmass,
                                 y = lnenergy,
                                 color = caste)) +
  geom_point() +
  geom_line() +
  custom_theme()

rat_plot

Castes clearly vary in mass. It is hard to discern the relationship between energy and mass here. Maybe workers have higher energy?

Step 2:

Fit a model with BOTH sets of predictors, using least squares and evaluate all relevant assumptions. List them out as you test them. What new assumption are we testing here? Can we use this model? If not, fix it. But if we can, no fix is needed!

rat_lm <- lm(lnenergy ~ lnmass + caste,
             data = rats)

check_model(rat_lm)

residualPlots(rat_lm, test=FALSE)

check_predictions(rat_lm) %>%  plot()

check_normality(rat_lm) %>%  plot(type = "qq")

check_outliers(rat_lm) %>%  plot(type = "bar")

PPC looks pretty good. Some wavey-ness HOV. I can try asinh transforming.

Looking at the residuals - there is a bow shape to the fitted values. This may also require log transformation. A closer look at PPC indicates that the model is a good-ish fit for the observed data - in line but not an exact match. Everything looks good with the normality. No outliers.

rat_asing_lm <- lm(asinh(lnenergy) ~ lnmass + caste,
                 data = rats)

check_model(rat_asing_lm)

Little or no difference in HOV when asinh-transformed. I’m going to use the original model.

Step 3:

Compare the two castes energy expendeture at the mean level of log mass. Are they different? How would you discuss your conclusions.

emmeans(rat_lm,
        specs = ~caste) %>% 
  contrast(method = "pairwise") %>% 
  confint()
##  contrast      estimate    SE df lower.CL upper.CL
##  lazy - worker   -0.393 0.146 32   -0.691  -0.0957
## 
## Confidence level used: 0.95

Lazy mole rates have a lower energy expenditure than worker mole rats (see estimate = -0.393). A better way to say this:

Energy expenditure varies by mole rat caste, with worker mole rats expending more energy than lazy mole rats.

Step 4:

Plot the fit model with the fit confidence interval on top of the data. Use modelr::data_grid() and broom::augment().

Impress yourself! Do the same thing, but backtransforming both your predicted values AND lnmass (so, plotting mass on the x and energy on the y) with CIs. Note - this is not a trick, and should only require some basic mutates (or advanced, if you want to dig into across() and where() - which, well, you should check them out!). Do you learn something different from this plot? Note, a log-log relationship corresponds to the following: if y = kx^n then log(y) = log(k) + nlog(n)

library(modelr)
## 
## Attaching package: 'modelr'
## The following objects are masked from 'package:performance':
## 
##     mae, mse, rmse
## The following object is masked from 'package:broom':
## 
##     bootstrap
energy_predictions <- data_grid(rats,
                                lnmass = seq_range(lnmass, 100),
                                caste = caste) %>% 
  add_predictions(model = rat_lm, var = "lnenergy") %>% 
  augment(rat_lm,
          newdata = .,
          interval = "confidence")

energy_predictions_plot <- ggplot(data = rats,
                                  mapping = aes(x = lnmass,
                                                y = lnenergy)) +
  geom_point() +
  geom_line(data = energy_predictions,
            mapping = aes(group = caste)) +
  geom_ribbon(data = energy_predictions,
              mapping = aes(group = caste,
                            ymin = .lower,
                            ymax = .upper),
              color = "lightgrey",
              alpha = 0.25) +
    facet_wrap(vars(caste)) +
    custom_theme() +
  labs(title = "Influence of Mass and Caste on Energy Expenditure",
       subtitle = "In Naked Mole Rats",
       x = "Mass",
       y = "Energy Expenditure")

energy_predictions_plot

Meta Questions:

Question 1:

Where do you think we will go next with models like there?

Since I am completing this a couple weeks behind - I am cheating a bit here. The next logical steps are too look at multiple predictors (also multiple categorical predictors) in more depth, looking at interaction effects, and dealing with more advanced models (such as GLMs).

Question 2:

How do you think you will use models like these in your own work?

I can apply some basic models to my work. For example, I can model how the level of my protein of interest, Capicua, influences wing size, wing vein length, and wing vein arborization. The utility of these models in my work is pretty limited.

I think there are interesting ways to apply these models to my hobby - baseball data analysis. I would like to model how pitch factors (such as spin rate, speed, vertical break, horizontal break, etc.) influence pitch outcome (strike outs, hits, home runs, etc.).

Question 3:

What about what we did this week was the most difficult for you to understand? Where are you feeling the boundaries of your understanding pushed the most?

This stretch of weeks nicely builds upon one another. I did not particularly struggle with anything during this week. After completing the midterm, this work is very easy now.

Question 4:

How much time did this take you, roughly? Again, I’m trying to keep track that these assignments aren’t killer, more than anything.

This took me ~ 45-60 min. It probably would have taken longer, but the midterm made me much more efficient with these types of tasks.

Question 5:

Please give yourself a weak/sufficient/strong assessment on this assignment. Feel free to comment on why.

Strong. I feel really comfortable with the process to create and analyze models, as well as the logic behind these steps. This is something that required a bit of practice, and having weeks 5-9 build on each other (plus the midterm) really helped drill these home.